_____________________________________________________________________________ Copyright 1994, Silicon Graphics, Inc. All Rights Reserved. THIS DOCUMENT CONTAINS UNPUBLISHED INFORMATION OF SGI The copyright notice above does not evidence any actual or intended publication or disclosure of this document, which includes information that is the confidential and/or proprietary, and is a trade secret, of Silicon Graphics, Inc. ANY DUPLICATION, MODIFICATION, DISTRIBUTION, PUBLIC PERFORMANCE, OR PUBLIC DISPLAY OF THIS DOCUMENT OR ANY OR ANY PORTION OF THIS DOCUMENT, WITHOUT THE EXPRESS WRITTEN CONSENT OF SILICON GRAPHICS, INC. IS STRICTLY PROHIBITED. THE RECEIPT OR POSSESSION OF THIS DOCUMENT DOES NOT CONVEY ANY RIGHTS TO REPRODUCE, DISCLOSE OR DISTRIBUTE ITS CONTENTS, OR TO MANUFACTURE, USE, OR SELL ANYTHING THAT IT MAY DESCRIBE, IN WHOLE OR IN PART. _____________________________________________________________________________ ~4Dgifts/toolbox/src/exampleCode/networking/HIPPI/HIPPIperf.txt HIPPI Performance on IRIX 5.2 by Paul Reilly MSD Marketing Silicon Graphics, Inc. June 24, 1994 SUMMARY: This white paper explores the performance envelope of Silicon Graphics IRIS HIPPI version 1.0 as run on the CHALLENGE(TM) and Onyx(TM) computer systems running IRIX version 5.2. This report is for the HIPPI expert. Acknowledgments: We wish to thank the following people who have contributed to this white paper (in alphabetic order): Scott Bovenizer, Lise Garrett, Thomas Skibo, Rob Warnock, and Audy Watson. (C) Copyright 1994, Silicon Graphics, Inc. All Rights Reserved HIPPI Performance on IRIX 5.2 Silicon Graphics, Inc. Mountain View, California IRIS, Silicon Graphics, and the Silicon Graphics logo are registered trademarks and CHALLENGE, Onyx, POWER Channel and IRIX are trademarks of Silicon Graphics, Inc. NFS is a registered trademark of Sun Microsystems, Inc. UNIX is a registered trademark of UNIX System Laboratories, Inc. Introduction. ============= Once upon a time, we had one of those `muscle' cars with 427 cubic inch engines, four-on-the-floor, traction control, and a radio blaring Beach Boys music at 110 dB. Of course, that was a very long time ago, back when the Beach Boys were still boys. Much has changed in the intervening years. Those originals gave way to a series of progressively more `sensible' cars until our present vehicle is barely capable of exceeding legal speed limit. Yet the memories linger of the weekends at the drag strip, the roar of powerful engines, the smell of burning rubber, the dreams of breaking a ten second elapse time.... While not quite as exciting as burning rubber for a hundred yards, we just had the opportunity to relive a little of that excitement. Silicon Graphics has recently announced a new IRIS HIPPI card for the CHALLENGE(TM) and Onyx(TM) families of computer systems, and we were asked to see just what it can do. A one word summary of the results is Awesome! The IRIS HIPPI Interface: A detailed description of the IRIS HIPPI interface in beyond the scope of this white paper--the documentation does an excellent job. Therefore, we recommend that if you are interested in the implementation details, you obtain a copy of IRIS HIPPI Administrator's Guide, Document Number 007-2229-002 which gives a fairly detailed description of the board's hardware implementation. Of equal importance is the IRIS HIPPI API Programmer's Guide, Document Number 007-2227-001. This describes the various software interfaces available to the IRIS HIPPI interface, including the HIPPI-PH layer which equates to raw mode I/O. You can get copies of this documentation from your local Silicon Graphics sales representative. Since most of you are probably interesting in the HIPPI-PH interface, we have attached example programs which use it in the appendices of this white paper. These are sink and blast. Neither is a complete, finished program. They are merely working examples, or so-called scrub programs which demonstrate how one should actually program the HIPPI-PH interface of IRIS HIPPI. However, they are working programs which you can try out on your IRIS HIPPI cards and tinker with to see just what performance you can get. In fact, we used them as part of the testing that went into this report. The Machines: ============= Naturally enough, we should start this performance report with a description of the equipment used. The systems used were a base model CHALLENGE L and a desk side Onyx, each with two 150 Mhz processors, 256 MB of memory (one-way interleaved), and a POWER Channel(TM) 2 (IO4). One HIO port had a SCSI adapter installed, while the other HIO port had the IRIS HIPPI card attached to it. The two systems were directly connected by HIPPI cables with no switch in between. The disk subsystem, consisted of six fast and wide SCSI disks attached to each system as a six-way-striped logical volume, with the disks striped three each on each of two controllers. Thus, the two systems were fairly common configurations, ones that you yourself are likely to have. As we will see, the disk configuration was the weak link--the IRIS HIPPI card was easily able to stay well ahead of this disk configuration. The Software: Each system was loaded with the released version of IRIX(TM) 5.2 and version 1.0 of the IRIS HIPPI driver and related software. The only tuning of the software was that the default TCP window/socket space was increased to 512 KB. That is, in /var/sysgen/master.d/bsd unsigned long tcp_sendspace = 60 * 1024; /* must be < 256K */ unsigned long tcp_recvspace = 60 * 1024; /* must be < 256K */ was changed to unsigned long tcp_sendspace = 512 * 1024; /* must be < 256K */ unsigned long tcp_recvspace = 512 * 1024; /* must be < 256K */ Please note that the comments on these two lines are incorrect, the maximum size is 512 KB. On to the Drag Strip: ===================== Naturally, whenever you have a shiny new hot rod, the first thing you want to do is to see just how fast it really is--so you take it to the local drag strip and burn rubber. The UNIX equivalent of the drag strip is raw mode I/O, which in the case of IRIS HIPPI translates into HIPPI-PH. As noted above, we used blast and sink to wring this interface out. They are attached in the appendices, and will be made available on a forthcoming developers toolbox. If you read the listings, you will find that blast, the transmitting program, has several options. The first is -1 or not. This controls whether it forks into two processes (-1 turns forking off). Since there are latency issues that effect performance such as doing the mpin and munpin of physical memory, it is desirable to run at least two interleaved processes if possible. Since this interface runs only on a multiprocessor computer (there are always at least two CPUs), it makes sense to do this. Since we were using only one HIPPI interface per system and no switch, the -D or -I switches, which define the device name and ifields, are not relevant. The final three options are the length of the write performed, the number of packets to be sent and the number of times to repeat the test. Various lengths of write were evaluated in powers of two from 256 to 2MB. The upper limit of 2MB was based on the amount of hardware buffering on the IRIS HIPPI board. It can handle a maximum of 2 MB. The results of running sink and blast over the IRIS HIPPI PH interface. They are in megabytes per second as reported by blast. The left-hand-most column is the write size (in bytes) used. The second column is the results when -1 is not used. This means that there was a fork and subtasking was used. The third column is the results when -1 (no forking) was used. I/O Size (forking enabled) (forking disabled) (bytes) (MB/s) (MB/s) 256 0.32 0.32 512 0.64 0.66 1024 1.36 1.36 2024 2.64 2.71 4096 5.28 5.13 8192 9.52 9.52 16384 16.62 16.98 32768 22.15 24.16 65536 33.60 31.88 131072 46.30 37.88 262144 61.88 50.61 524288 70.42 62.19 1048576 74.29 70.13 2097152 77.94 73.21 Table 1 Results of running sink and blast over IRIS HIPPI PH interface As you can see, the size of the I/O clearly makes a difference. The shorter I/O lengths gave fairly poor performance until the 512KB size was used. At that point, the performance gradually increased until the maximum of 78 MB/second was reached with 2MB writes and reads. The second observation is that enabling forking to share the CPU compute load between two processors is a win. The performance difference between the two modes of running blast and sink is slight until you start using I/O with lengths greater than 64 KB. After that, there is a five to ten percent gain with forking enabled. One point not shown in Table 1 is that the IRIS HIPPI interface can actually transmit faster than it can receive, and so the results in Table 1 are the receive performance data. If you are interested in the transmit performance of IRIS HIPPI, you can let the interface run free by simply not starting the sink task on the receiving system. If you were to do this, then the rate for 2 MB writes (using blast alone) is close to 92 MB/second. Cruising the TCP/IP Expressway: =============================== While HIPPI-PH is quite useful for a number of applications, all of them must be specially written to use it. There are, however, a large number of common programs which use the IP networking stack. Among them are rcp, ftp as well as NFS. We will start with NFS. NFS is a request/response protocol in which the client sends a request and then waits until the response (usually just a 8KB block of data) is fulfilled. History has shown that NFS is self-limiting, generally in the order of 800,000 bytes per second, which is roughly the bandwidth of ethernet. When FDDI first came out, everybody was surprised when NFS performance was only about 10% faster on FDDI than it was on ethernet. The NFS implementation is being further developed to improve its performance on these higher bandwidth media. These enhancements will be incorporated into version 5.3 of IRIX. On the other hand, TCP is well suited to high speed interfaces such as HIPPI. The reason is that it uses a sliding-window buffer to keep the data flowing almost as though there was a conveyor belt set up between the sender and receiver. It also has a number of very clever optimizations. One is `slow-start' which starts the data transfer over TCP at a fairly low rate and gradually (for a computer at least) increases the flow of data until either it finds that it lost a packet or it reaches full-throttle. There is also something called MTU discovery which permits two TCP entities to negotiate the size of packets to be sent between them. Thus, TCP is self-governing, adapting itself to conditions on the network. Tuning Silicon Graphics' TCP for Speed. Recently, RFC 1323 was published which includes additional TCP performance improvements. However, some of them, such as large window sizes, will actually cause performance losses in older systems that don't have them. So to keep the peace, we ship IRIX with these in it and then turn them off so as to keep compatibility with the older versions of TCP. The controlling variables can be found in /var/sysgen/master.d/bsd. They read: /* TCP window sizes/socket space reservation */ unsigned long tcp_sendspace = 60 * 1024; /* must be < 256K */ unsigned long tcp_recvspace = 60 * 1024; /* must be < 256K */ /* TCP large windows (RFC 1323) control. */ int tcp_winscale = 1; int tcp_tsecho = 1; The last two variables are set to true and should be left that way. The way we actually control whether RFC 1323's features are turned on is by the window sizes. If they are set to 60 * 1024 or less, then the assumption is that the RFC 1323 features should not be used. On the other hand, if the window size as set to greater than 60 KB, then they should be used. There are two ways of doing this. First, you can, as we did, edit the file and increase tcp_sendspace and tcp_recvspace. And as noted above, the correct upper limit is 512KB, not 256KB. What this does is set the default TCP window sizes. You can also set the window sizes by a setsockopt call. This is handy in that you can selectively increase the TCP window size for a particular interface such a HIPPI without increasing it for all your networking interfaces. We will get back to this in a moment when we discuss ttcp below. The reason we chose to edit the /var/sysgen/master.d/bsd file was that some programs, such as ftp, do not have this setsockopt call, and so this forces you to increase the window sizes by using the defaults defined in /var/sysgen/master.d/bsd. Memory-to-Memory TCP/IP performance over HIPPI. For sometime now, ttcp has been the acknowledged routine for testing TCP/IP performance. It started out in the early 1980s as a program from BRL, then followed several paths to virtually every UNIX implementation that supports TCP/IP (which is virtually all of them.) The particular version used in this test is the one to be found in the optional eoe2.sw.ipgate subsytem that can be selected during inst. The source, or a reasonably facsimile, is distributed in /usr/people/4Dgifts/examples/network, so if you don't have it on your system, or you would like to play with it, you can find it there. While ttcp can be used to test UDP performance as well, we limited ourselves to the TCP/IP mode. ttcp can also be used to do disk reads and writes, but most people prefer to use rcp or ftp for that, so we used ttcp solely in its memory-to-memory mode. That is, we used the -s switch. A few words of warning need to be made about both ttcp and TCP/IP. First, ttcp can and does use all those neat features that make TCP/IP run fast, including slow-start, MTU discovery, and a host of other features. As noted, TCP/IP will spend several seconds when a transfer is first started to `feel' out the network and determine just how fast it can run. As we will see, very short runs--runs of less than 10 seconds--give spuriously low performance. As a rule-of-thumb, you should never make a ttcp run of less than 30 seconds. We prefer 60 seconds as a minimum. Thus, you should always set the -n switch to some number which gives a run that last at least 30 seconds. Otherwise, you will not have valid results. Now for those results. The ttcp Results. ================= We ran ttcp in TCP mode and memory-to-memory mode only. That is to say the generic ttcp command for receive was #ttcp -r -s -l#### while the generic ttcp command for transmit was #ttcp -t -s -l#### -n100000 -b524288 [-D] hippi-hostname The values for -l#### varied from 256 to 65536 bytes by powers of 2, with the exception of 61440 which has a special meaning that we will explain in a minute. For all buffer sizes except 61440 and 65536, we used -n100000. In the case of -l61440 and -l65536, -n50000 was used instead. Thus, all tests ran for at least 17 seconds, while the longer lengths ran for about one minute. The -b524288 option sets the TCP window size to 524288 (or 512K), which could be used in place of the changes in /var/sysgen/master.d/bsd noted above, (-b sets the socket buffer size of SO_SNDBUF/SO_RCVBUF with a setsockopt call). In addition, we used the -D to set TCP_NODELAY on some of the tests to see what impact this feature would have on performance. Those results are indicated in Table 2. The results of running ttcp in memory-only- mode over IRIS HIPPI using TCP/IP. The various options for ttcp were as noted above. The results are in kilobytes per second (i.e. 1024) as reported by ttcp. The results for both the transmitting and receiving task of ttcp are shown. -D on No -D I/O Size Receive Transmit Receive Transmit (bytes) (kbytes/s) (kbytes/s) 256 1055 1055 1587 1588 512 2463 2464 3498 3500 1024 5803 5805 6208 6211 2048 12143 12160 11786 11791 4096 22001 22007 32626 32654 8192 32972 32984 40666 40681 16384 38756 38763 41334 41341 32768 42235 42240 42908 42913 61440 47192 47196 46213 46216 65536 42100 42103 41992 41997 Table 2 Results of running ttcp in memory-only-mode over IRIS HIPPI using TCP/IP The results show that one can get very good TCP/IP performance over IRIS HIPPI. The maximum performance was about 47 to 48 MB/second. Not surprisingly, the performance increased as the size of the read/write increased, with the exception of 60KB (61440) which is clearly the sweet spot in this series of tests. Then the performance falls off slightly at 64KB. This is not unexpected. Remember that the IP has a 16-bit length field and so the size of a TCP/IP packet cannot be larger than 64KB. However, since you still have to wrap the TCP portion of the transmitted frame with IP headers and trailers, you cannot really send a 64KB TCP packet without fragmentation. Thus, 60KB is the sweet spot as it is the largest hunk of user data that can be sent inside of 64KB TCP/IP encapsulation and still be a multiple of 4KB, the page size. Thus, we expected the maximum performance to be at 60KB because it would fully utilize the available space inside of a TCP/IP packet, avoid any packet fragmentation. The reason why the results for 64KB buffers was worse than even the 32KB buffer was because there was fragmentation taking place. Naturally, this leads to the question of what happens if you fail to use a buffer that is a multiple of 4KB. Well, on the transmission side, the answer is not much for it is still possible to `page flip' the output buffer down to the device driver and so avoid copying data. However, on the receive side, you cannot do this indiscriminately as it is possible to destroy information the user's program has stored next to the receive buffer. What has to happen in this case is that any part of an incoming buffer that does not completely fill a virtual memory page must be copied into the user's buffer. We did look at a number of tests in which we used a fairly large buffer size (60144 and 49152) that wasn't a multiple of 4KB. The results were fairly consistent. If you forced a copy of data because your buffer was not exactly a multiple of 4KB, the performance fell to about 27-28 MB/second, which still isn't that bad when compared to the competition. As for the TCP_NODELAY feature of ttcp, we are a little surprised that the -D switch didn't give more of a difference than observed. Copying Files over IRIS HIPPI: ============================== While blasting data in HIPPI-PH mode and doing memory-to-memory transfers with TCP/IP are interesting, there is a great deal of interest in disk-to-disk transfers with rcp and ftp. While both of these routines use TCP/IP, they also use the disks which add another level of complexity, i.e. disk performance. So far, we had five or six users of IRIS HIPPI call and complain about ftp performance in particular. They usually couldn't get more than four or five megabytes per second. Our first question is `How did you stripe your disks?' Inevitably, the answer was `Huh?' That is to say, what they really were doing was copying a file from a unstriped disk over IRIS HIPPI to another disk which was unstriped. Since the typical SCSI-II disk drive can read or write at between four to five megabytes per second, what they were actually doing was simply measuring how fast the disk drives could do I/O. Therefore, it is mandatory to use striped disks to get really good disk-copy performance over IRIS HIPPI. Originally, we planned to get the largest disk farm we could to do these tests. Unfortunately, a number of outside pressures limited us to a total of twelve disks. These were striped three disks each on two SCSI controllers, so we ended up with six-way striping. Under ideal conditions, we can often get about 14 MB/sec off of a SCSI controller which has three disks on it, so it is likely that we could get 20 to 25 megabytes off a the six-way striped configuration. As it turned out, there were other problems that we ran into. Copying Files with rcp. ======================= While most people who buy HIPPI interfaces tend to use ftp for copying files, we decided look at rcp as well. We'll report those results first, and then go into ftp a good deal deeper. Our rcp copying tests were fairly simple. We used mkfile to create a number of files, varying in size from 1 megabyte to 1 gigabyte. Then we ran rcp to copy them over IRIS HIPPI using timex to see how long it took. Next, we did a divide and, viola, we had how many megabytes per second rcp copied the files. This was done on both the 6-way striped disks as well on single unstriped disks to underscore that you really need to stripe your disks if you want performance. The results are in Table 3; however, before we get to the actual numbers, we had better note that we did transform the numbers slightly so that they are comparable to ftp's reported performance. As many of you know, ftp reports its performance in KB (1024), but the number printed is in base ten. In other words 1000 KB, or a megabyte, is really 1000 * 1024. Therefore, we converted the rcp performance to the same format so that apples-to-apples comparisons are possible to ftp performance. File copy performance of various sized files over IRIS HIPPI using rcp on 6-way striped and unstriped disks. As noted above, data was converted to be comparable to ftp performance results. File Size 6-Way Striped Unstriped 1 MB (1024*1024) 1.11 MB/second 1.04 MB/second 10 MB (10*1024*1024) 5.59 MB/second 4.36 MB/second 100 MB (100*1024*1024) 7.32 MB/second 4.59 MB/second 1 GB (1000*1024*1024) 7.45 MB/second 4.58 MB/second Table 3 File copy performance of various sized files over IRIS HIPPI using rcp on 6-way striped and unstriped disks Quite clearly, the results show that striping the disks is important. The performance for unstriped disks is limited to the disk's speed. In the case of 6-way striping, we got as much as 7.45 MB/second using rcp. The lower performance shown for the 1 MB and 10 MB files is due to both TCP's self-regulation features (slow-start, MTU discovery, etc.), as well as to a rather set-up handshaking (user validation etc) that occurs between rcp and rcpd. It is safe, however, to say that rcp can get about 7.45 MB/second at least with a striped disk as tested. While 7.45 MB/second for the six-way striped disks is a lot better than the 4.5 MB/second number obtained from the unstriped disk, it is still nothing like the 20 to 25 MB/second that we could have gotten. The question is why? We explored this when we tested ftp performance on the same disk and files we used to test rcp. Copying Files with ftp. ======================= Obviously, we were a little disappointed with the file copying performance we saw with rcp. So the first thing we did was to repeat the test with ftp, using the same disk configurations and file sizes we used in the rcp test. The results are shown in Table 4. We refer to this version of ftp as `standard' for it is the version shipped with IRIX 5.2. As you will see, we ended up modifying ftp. File Size 6-Way Striped Unstriped 1 MB (1024*1024) 3.97 MB/second 3.35 MB/second 10 MB (10*1024*1024) 4.62 MB/second 4.01 MB/second 100 MB (100*1024*1024) 5.91 MB/second 4.38 MB/second 1 GB (1000*1024*1024) 7.07 MB/second 4.45 MB/second Table 4 File copy performance of various sized files over IRIS HIPPI using standard ftp on 6-way striped and unstriped disks Oh, well, so ftp is not any better than rcp. If anything, it has slightly worse performance. Please note that the apparent improvement in the 1 MB file case, where ftp transferred data at 3.97 MB/second is really because rcp is doing that set-up handshaking we noted, and ftp isn't doing it. Once again, we are less than totally pleased with the results. After all, we have seen disk-to-disk copying done as much higher speeds than 7 MB/second when six-way striped disks were used. And we also know that ttcp can run nearly 48 MB/second over IRIS HIPPI. However, those disk- file copying tests were not done with cp, but with special purpose programs that took advantage of the features of IRIX. Perhaps the problem is in ftp. As it turned out, it was. About this time, we decided to have a hard look at ftp. That is, we dug the sources out and looked at what it was doing. And, guess what? It was doing its I/O, both to and from disk as well as to TCP/IP in 16 KB blocks. While this is adequate for, say, copying files over the internet, it is nothing like the software needed to drive megabytes over the HIPPI interface. Careful examination of ftp's sources disclosed a number of other flaws that would limit disk as well as network performance. However, the biggest problem was the 16 KB I/O being used. So, we got out our software meat axe and hacked ftp a bit. Actually, all we did was to increase the buffer size definition, which in turn, increased the nbyte parameter in the reads and writes because they were defined in terms of sizeof(buffer). Thus, it was very easy to change the I/O size being used not only to and from disk, but also over the IRIS HIPPI interface as well. We chose two new buffer sizes: 60KB and 960 KB. The 60 KB size was chosen because it is the sweet spot we found with ttcp in Table 2, and 960 KB was was 16 times as large. In hindsight, we should have used even bigger buffer sizes. First, the results of the 60 KB version of ftp. Both ftp and ftpd were modified as noted above. Then we repeated the testing we did on the standard version of ftp. Both ftp and ftpd were modified to do I/O in 60KB buffers. File Size 6-Way Striped Unstriped 1 MB (1024*1024) 3.54 MB/second 3.41 MB/second 10 MB (10*1024*1024) 7.21 MB/second 4.23 MB/second 100 MB (100*1024*1024) 7.79 MB/second 4.43 MB/second 1 GB (1000*1024*1024) 7.72 MB/second 4.39 MB/second Table 5 File copy performance of various sized files over IRIS HIPPI using ftp on 6-way striped and unstriped disks That's better! However, it isn't a whole lot better than rcp. However, it does show that the performance problem is not in IRIS-HIPPI but in ftp itself. So, let's try the 960 KB buffer size. Both ftp and ftpd were modified to do I/O in 960KB buffers. File Size 6-Way Striped Unstriped 1 MB (1024*1024) 3.32 MB/second 3.29 MB/second 10 MB (10*1024*1024) 7.69 MB/second 4.32 MB/second 100 MB (100*1024*1024) 9.51 MB/second 4.53 MB/second 1 GB (1000*1024*1024) 9.52 MB/second 4.28 MB/second Table 6 File copy performance of various sized files over IRIS HIPPI using ftp on 6-way striped and unstriped disks Still better, but not quite 10 MB/sec. However, it is now very clear that there are still other problems in ftp that need to be solved if we are to get warp-speed performance. We looked at them and decide to leave them alone as it would take a major rewrite of ftp to do what was really necessary. This consisted mainly on redoing the disk I/O buffering, and we frankly didn't have the time. However, there were a few things we could do to get some idea just what performance we could eventually get. One was to cheat on reading the disk by using two copies of ftp and ftpd to read and write the same files at the same time. This is a bogus measure of disk I/O for what would happen is EFS simply reads the second stream of I/O out of the disk cache on the transmitting system. However, it would nevertheless send two streams of data through TCP/IP and IRIS HIPPI. Thus we could still measure the transfer rate over the network. We got an aggregate of 15.66 MB/second when we did this. Summary of ftp performance. =========================== Thus, in summary, ftp performance is really limited by the way the code is written. Clearly, much higher file transfer rates are possible when even simple things like increasing the I/O buffer sizes are done.